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Abstract — In this paper, we propose a Bayesian Hypothesis 
Testing Algorithm (BHTA) for sparse representation. It uses 
the Bayesian framework to determine active atoms in sparse 
representation of a signal. 

The Bayesian hypothesis testing based on three assumptions, 
determines the active atoms from the correlations and leads to the 
activity measure as proposed in Iterative Detection Estimation 
(IDE) algorithm. In fact, IDE uses an arbitrary decreasing 
sequence of thresholds while the proposed algorithm is based on 
a sequence which derived from hypothesis testing. So, Bayesian 
hypothesis testing framework leads to an improved version of 
the IDE algorithm. 

The simulations show that Hard-version of our suggested 
algorithm achieves one of the best results in terms of estimation 
accuracy among the algorithms which have been implemented in 
our simulations, while it has the greatest complexity in terms of 
simulation time. 

Index Terms-Sparse representation, Compressed sensing, 
Sparse component analysis, Blind source separation, Bayesian 
approaches, Pursuit algorithms. 



I. Introduction 

Finding (sufficiently) sparse solutions of underdetermined 
systems of linear equations (possibly in the noisy case) has 
been used extensively in signal processing community. This 
problem has found applications in a wide range of diverse 
fields. Some applications are Blind Source Separation (BSS) 
and Sparse Component Analysis (SCA) [1], [2], decoding 
[3], image de-noising [4], sampling and signal acquisition 
(compressed sensing) [5], [6] and regression [7]. 

The problem can be stated in various contexts such as sparse 
representation, SCA or Compressed Sensing (CS). Here, we 
use the notation of sparse representation of signals. Let the 
model be: 

x = <fry + e. (1) 

where x is an n x 1 signal vector, y is an m X 1 sparse 
coefficient vector, <& is an n x to matrix called dictionary 
and e is a n x 1 error vector. It is assumed that n < m 
which means that the signal length is smaller than the number 
of columns of the dictionary (which are called atoms [8]). 
So, the number of columns of the dictionary is more than 
the number of rows of the dictionary, that is, the dictionary 
is overcomplete. The main assumption is that the signal has 
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a sparse representation in this overcomplete dictionary. The 
main goal is to find the sparse coefficient vector y based on 
the signal x and knowing the dictionary <fr. This problem 
is nominated as sparse representation of the signal and the 
methods are called sparse representation or sparse recovery 
algorithms. 

According to applications, the vector interpretations are 
different, but in all of them the model follows (1). For example, 
in the context of CS, is the measurement matrix, x is a 
vector whose the few components are measurements of the 
signal and y is the sparse representation of the true signal. In 
the context of SCA, $ is the mixing matrix, x is the mixture 
vector and y is the source vector. 

Because of n < to, there are usually infinitely many 
solutions of this underdetermined system of linear equations. 
In the exact sparse representation case, if we restrict ourselves 
to sufficiently sparse coefficient vectors, it is proved that under 
some conditions the sparsest solution is unique [9], [10], 
[11]. In the noisy case, there are theoretical guarantees (in 
terms of sparse coefficients and dictionary) for accurately and 
efficiently solving the problem [12], [13]. 

Finding the sparsest solution, that is, the solution with 
the minimum number of nonzero elements, is an NP-hard 
combinatorial problem. Different methods have been proposed 
to solve the problem in a tractable way. Most of them can be 
divided in two main categories: 1) Optimization approaches 
and 2) Greedy approaches (or pursuit algorithms). The first 
category solves the problem by optimizing a cost function 
according to different methods. The second set of methods 
tries to find active coefficients (with nonzero elements) directly 
through an algorithm. 

The optimization approaches are basically split into convex 
and non-convex optimization methods. The most successful 
approach which is Basis Pursuit (BP) [14], suggests a convex- 
ification of the problem by replacing the ^°-norm 1 with the 
£ 1 -norm. It can then be implemented by Linear Programming 
(LP) methods. Recently, a Gradient-Projection algorithm for 
Sparse Reconstruction (GPSR) is used for bound-constrained 
quadratic programming formulation of these problems [15]. 
A method for large scale I 1 -Regularized Least Square (l 1 - 
RLS) is also devised in [16]. In addition, an Iterative Bayesian 
Algorithm (IBA) is used for solving the problem with a convex 
cost function which its steps resemble E-step and M-step of 
an EM algorithm [17], [18]. 

(P norm of a vector is defined as the number of its non-zero components. 
Although it is not a mathematical norm, we use this name because it is 
frequently used in the literature. 
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Among the nonconvex cost function methods, the FOCUSS 
algorithm uses i? p -norm with p < 1 instead of ^°-norm 
in the noise-free case [9], [19]. Regularized-FOCUSS (R- 
FOCUSS) method extends FOCUSS for the noisy case with 
a Bayesian framework [20]. There are also some Bayesian 
methods such as Relevance Vector Machine (RVM) [21], 
Sparse Bayesian Learning (SBL) [22] and recently a Bayesian 
Compressive Sensing approach (BCS) [23] which mainly solve 
a nonconvex problem. Recently, a smoothed version of the £°- 
norm was used for solving the problem by a gradient-ascent 
method which is called Smoothed-^°(SL0) [24]. Moreover, a 
Sparse Reconstruction by Separable Approximation (SpaRSA) 
algorithm is suggested for group separable regularizes which 
is usually nonsmooth and possibly also nonconvex [25]. There 
is also an Iterative Reweighted Algorithm for nonconvex CS 
(IRA) [26]. 

The other category is the greedy algorithms which choose 
successively the active coefficients without having any explicit 
cost function. Generally, they use the correlation between the 
signal (or residual signal) and the atoms of the dictionary as 
an informative measure for deciding which coefficients are 
actually active (or nonzero). These algorithms are Matching 
Pursuit (MP) [8], Orthogonal Matching Pursuit (OMP) [27], 
Stage-wise OMP (StOMP) [28], Weighted MP (WMP) [29], 
Tree-Based Pursuit (TBP) [30], Regularized OMP (ROMP) 
[31], Gradient Pursuit (GP) [32], Stagewise weak Gradient 
Pursuit (StGP) [33] and Compressive Sampling MP (CoSaMP) 
[34]. 

Besides these two main approaches, one can mention Iter- 
ative THresholding algorithms (ITH) [35], an Iterative Detec- 
tion Estimation (IDE) method [36] and three Minimum Mean 
Square Estimation (MMSE) algorithms [7], [37] which can be 
considered as Bayesian approaches which use discrete search 
techniques for finding the dominant posteriors. In [7], an 
algorithm is proposed for Approximating the MMSE estimate 
(A-MMSE) for the sparse vector in the application of linear 
regression. [37] also presents a Fast Bayesian Matching Pursuit 
(FBMP) method for recursive MMSE estimation in linear 
regression models. 

Table I shows the overall sparse representation algorithms 
that we have mentioned. In this table, Bayesian methods 
are highlighted with bold characters. One can consider the 
Bayesian methods as a distinct category, but we did not do 
that because they also need some kind of cost function and 
optimization techniques or algorithms to solve their problem. 

An important task in the sparse representation is to deter- 
mine which coefficients are nonzero or in other words which 
atoms are active in the sparse representation of the signal. 
This is mainly done with Correlation Maximization (CM) in 
the pursuit algorithms with some differences. So, the core 
idea of the pursuit algorithms is to use the correlation of the 
residual signal with the atoms to determine the active atoms. 
For example, MP uses the CM to select at each iteration one 
active atom. StOMP uses a thresholding to select several active 
atoms at a same time. Other methods like IDE, use a measure 
of activity to determine the corresponding nonzero coefficients. 
The IBA algorithm [17] uses a steepest-ascent to determine a 
vector which is defined as the activity vector. 



TABLE I 

Sparse Representation algorithms ( bold names are Bayesian 
approaches). 



Optimization algorithms 


Greedy 


Other 


Convex 


Nonconvex 


Algorithms 


Algorithms 


BP [14] 


FOCUSS [9], [19] 


MP [8] 


ITH [35] 


GPSR [15] 


R-FOCUSS [20] 


OMP [27] 


IDE [36] 


£ 1 -RLS [16] 


RVM [21] 


StOMP [28] 


A-MMSE [7] 


IBA [17] 


SBL [22] 


WMP [29] 


FBMP [37] 




BCS [23] 


TBP [30] 






SL0 [24] 


ROMP [31] 






SpaRSA [25] 


GP [32] 






IRA [26] 


StGP [33] 








CoSaMP [34] 





The simplicity of the greedy algorithms or pursuit algo- 
rithms arises in determining one active atom (e.g., in MP) 
or several active atoms (e.g., in StOMP) at an instant. So, 
they determine the activity vector in a simple way rather than 
to solve a hard optimization problem in a multi-dimensional 
space. The basic idea of this paper is to use the correlation 
between the signal and atoms like pursuit algorithms. Then, 
a Bayesian hypothesis test is used to estimate the activity 
measure for each coefficient separately. So, the aim of this 
paper is to estimate simple activity measures using a Bayesian 
framework. This is done by three simple assumptions which 
are needed when we devise our algorithm. These assumptions 
are just approximations and the algorithm is devised under 
these simplifying assumptions. The results of this work have 
been partially presented in [38]. 

The activity measure we obtain in this method is similar 
to what has already been obtained by IDE algorithm [36]. 
The main difference, however, is that the threshold is obtained 
mathematically and is calculated throughout the algorithm by 
some simple parameter estimation techniques. 

In this paper, we first introduce our system model and some 
notations in Section II. Then, in Section III, we propose our 
Bayesian Hypothesis Testing Algorithm (BHTA). Section IV 
investigates the stability analysis of the algorithm. Finally, in 
Section V, we investigate the experimental performance of the 
BHTA in comparison with other main algorithms. 

II. System model 

The noise vector e in (1) is assumed to be zero-mean 
Gaussian with covariance matrix a^I. In the model, the 
coefficients are inactive with probability p, and are active with 
probability 1 — p (sparsity of y implies that p should be near 
1). In the inactive case, the values of the coefficients are zero 
and in the active case the values are obtained from a Gaussian 
distribution. We call this model the 'spiky model' which is a 
special case of the Bernoulli-Gaussian model with the variance 
of the inactive samples being zero. This model has been also 
used in [39] and [7]. It is suitable for sparse representation 
of a signal where we would like to decompose a signal as a 
combination of only a few atoms of the dictionary and the 
coefficients of the other atoms are zero. So, the probability 
density of the coefficients in our problem is: 

p(yi)=p6( yi ) + (l-p)N(0,a 2 r ). (2) 
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where S(.) denotes the Dirac impulse function. In this model, 
each coefficient can be written as yt = q^i where g, is a 
binary variable (with a binomial distribution) and r t is the 
amplitude of the i'th coefficient with a Gaussian distribution. 
Each element qi is the activity of the corresponding coefficient 
(or corresponding atom): 



1 if yi is active (with probability 1 — p) 
if yi is inactive (with probability p) 



(3) 



Consequently, the probability p(q) of the activity vector q = 
{qi,q 2 , ■■■,q m ) T is equal to: 



(4) 



where n a is the number of active coefficients, i.e., the number 
of l's in q. So, the coefficient vector can be written as: 



Qr. 



(5) 



where Q = diag(q) and r = (n, r 2 , r m ) T is the 'ampli- 
tude vector' . Note that, in this paper, we use the same notation 
p(-) for both probability and Probability Density Function 
(PDF). 

III. Bayesian Hypothesis Testing Algorithm 
(BHTA) 

The main task in sparse representation algorithms is to 
determine which atoms are active in the sparse representation 
of the signal. This can be viewed as a detection task like in 
the IDE algorithm [36] which an activity function is compared 
with a decreasing threshold. In some pursuit algorithms (e.g., 
MP), it is determined by Correlation Maximization (CM). In 
some other pursuit algorithms (e.g., StOMP), it is done by 
comparing the correlations with a threshold. In the MAP sense, 
it is done with posterior maximization over all possible activity 
vectors [40]. In IBA algorithm [17], the maximization is done 
by a steepest-ascent algorithm in the M-step within a MAP 
sense framework. Here we want to determine the activity by a 
Bayesian hypothesis testing from the correlations. The possible 
strategies for determining the active atoms for the various 
algorithms are schematically depicted in Fig. l(a)-(e). 

To develop a hypothesis testing approach, we write (1) as: 



X = 'PiVi + e - 



(6) 



»=i 



where ip i is the i'th column (i.e., the i'th atom) of the 
dictionary. So, the correlations between the original signal and 
the atoms are: 



Zj =< x, (fj >= y 3 + Y yibij + vj. 



(7) 



i=l 



where bij =< ip i7 <fj > and vj =< e, tfj >, and the atoms 
are assumed to have unit Euclidean norm. 

To do a Bayesian hypothesis test based on correlations for 
determining the activity of the j'th atom, we must compute the 
posteriors p(Hi\x) and p(H 2 \z), where Hi is the hypothesis 
that the j'th atom is active and H 2 is the hypothesis that 
the j'th atom is inactive. To obtain a simple algorithm like 
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Fig. 1. Idea of detection in various algorithms, (a) MP or OMP (b) StOMP 
(c) IDE (d) MAP sense (e.g., IBA) (e) Bayesian Hypothesis Testing Algorithm 
(BHTA). 



pursuit algorithms, assuming the previous estimations of all 
other coefficients (except the j'th coefficient), we want to 
detect the activity of only the j'th atom and then update only 
the j'th coefficient. 

Since we assume that we know previous estimations of other 
coefficients, (7) can be written as: 

m m 

Zj- Y y l b ij ^y :i + Y (Vi ~ Vi) b ij + v j- ( 8 ) 

where y^ is the estimation of the i'th coefficient at the current 
iteration. Let define: 



Cj y ' i)ibij ■ 
m 

7j ~ {yi-Vi)bij+Vj. 

The two hypotheses Hi and H 2 are then: 

u fU {Hi: Zj - Cj = Tj + 7j 
Hypotheses : < TT J J J J 

{ H 2 : z 3 - Cj = lo 



(9) 



(10) 



(11) 



where Cj is known and jj is a noise or error term. In fact, 
Eq. (11) is a classical detection problem. 

A. Hard-BHTA 

In this section, we suggest a classical detection solution for 
solving the problem (11). As it was said before, the hypothesis 
test involves the computation of the overall posteriors p(Hi\x) 
and p(H 2 \z). But, with the previous formulations, we reach 
a relatively simple detection problem as in (11). For the 
simplicity of the algorithm like the pursuit algorithms, we 
rely only on the correspondent correlation (e.g., Zj) and 
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hence the simpler posteriors as p(H\\zj) and p(H 2 \zj). So, 
the hypothesis Hi is chosen when p{H\\zj) > p(H 2 \zj), 
otherwise H 2 is chosen. 

Based on Bayes' rule, the above posteriors are proportional 
to p{H 1 )p{zj\H 1 ) and p{H2)p{zj\H 2 ) respectively. The prior 
probabilities for the hypotheses are p{H\) = 1 — p and 
p(H 2 ) = P where p is defined in Section II. 

Now, for developing our algorithm, we assume the following 
three main assumptions: 

Assumption 1: (yj — yj) and (yj — yj) are assumed to be 
uncorrected for i ^ j. 

Assumption 2: The noise term Vj is uncorrelated of the 
error (y { - yj) for i ^ j. 

Assumption 3: The term jj in (10) has a Gaussian distri- 
bution. 

Strictly speaking, Assumption 1 is not mathematically true, 
because the estimated value of one coefficient clearly influ- 
ences the estimation of the other coefficients. However, in the 
following, this assumption provides a first order approximation 
of a sequence of thresholds for the algorithm, instead of using 
a heuristically predetermined sequence of thresholds as done 
in IDE [36]. More precisely, in the following, this assumption 
is used only for deriving (18). On the other hand, heuristically, 
we expect that, as the algorithm converges to the true solution, 
the outputs are closer to the true estimated values and the 
estimation of each coefficient has less influence to estimating 
the other ones. We will study this heuristic in our simulations 
(see Fig. 5). In fact, (18) requires Assumption 2, too which 
is just used here. Consequently, the experiment of Fig. 5 will 
experimentally study both Assumptions 1 and 2. 

Moreover, Assumption 3 is not strictly true, too. However, 

since jj = Y^Li^jiVi ~ Vi) b ij + v o is a sum of man y 
(especially for large m's) random variables, one expects from 
Central Limit Theorem (CLT) that this assumption be a 
good approximation. We will also study the validity of this 
assumption experimentally (see Fig. 6). This assumption will 
be used in deriving the activity measure. 

Now, let a 2 , denote the variance of -fj which is assumed to 
be a Gaussian random variable by Assumption 3. Therefore, 
the activity condition writes: 

d-f) :OXp C'^»' )> 



2ir(a 2 



73 



a?) 



"2(a 2 



2*0*, 



■. exp(- 



2< 



Simplifying (12) with the assumption that the (unknown) 
parameters p, ay and o lj are known, leads to the following 
decision rule for the hypothesis testing: 



Activity (yj) = \zj — Cj\ > Thj. 
where Thj is the threshold defined as: 



(13) 



2(<72 + <4)ln( 



2 + < 



decreasing sequence of thresholds. Improvements with respect 
to IDE method is that the value of threshold is obtained 
mathematically with respect to the parameters of the statistical 
model, i.e., following a Bayesian hypothesis test. Another 
important difference is that, IDE only uses the same threshold 
for all coefficients, while BHTA could use a different threshold 
for each coefficient. However, as we will state in Section V, 
we use the same threshold for all the coefficients to simplify 
the algorithm. 

Although (14) determines the optimal threshold, it depends 
on unknown parameters (p, ay and ct 7 ) which should be 
estimated from the original signal (x). Since estimating the 
parameters needs also the activity vector q which is derived 
itself by the value of threshold, we use an iterative algorithm. 
To estimate the parameters p, oy and ay, we can use sample 
estimate formulas, which are: 



V = 



|q||o 



*y||s 



(15) 
(16) 
(17) 



where q is obtained from the previous iteration of the decision 
rule (13). In [17], it has been proved that these estimates 
are the MAP estimation of these parameters knowing all 
other parameters. The initialization of these parameters is also 
detailed in Section V-A. 

The problem here is to estimate the parameter o lj which 
is the standard deviation of jj in (10). By taking the variance 
from (10), since Vj is a Gaussian random variable with the 
same variance as ej which is equal to a 2 , then by Assumption 
1 and Assumption 2, we will have: 



E 



b ij a i,e y - 



(18) 



where of e is the variance of the error term (yj — yj). The 
accuracy of the above formula depends on the validity of 
Assumptions 1 and 2, and will be experimentally studied in 
Section V. 

If the algorithm converges, we expect that a 2 ,, decreases. 
(12) So, we enforce the error variance to decrease geometrically: 



(n+l) _ 



= OCT, 



(n) 



(19) 



where the parameter a, less than but close to 1, determines 
the rate of convergence. 

In Appendix I, it is shown that if we choose the minimum 
£ 2 -norm solution for the first iteration, then the initial estimate 



c .u 2 (0) • 

of the variance a, „ is: 

2 (0) 



4:: = ° 2 r( E ^)+-e 2 E^- 

iesupp(y) i 



(20) 



l-p 



). (14) where * = [ipjj] = -I+* t * and L = [kj] = 2$ T -* f . The 



The decision rule and the activity function in (13) are the 
same as in IDE algorithm [36], where one uses a predefined 



notation supp(y) denotes the indices where the coefficients 
are nonzero. But, we do not know in advance where the 
nonzero elements are. So, we replace the first right side 
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term of (20) by its mathematical expectation. The expectation 
£(£ iesU pp (y )^) is equal to E^.q^) = E^(ft)# 
where q,- L is the activity of the z'th element which is a Bernoulli 
variable, and hence E(qi) = (1—p). So, the following formula 
can be used to estimate the initial parameter estimation: 



2 o 

3, e v 



^(i-p)IWl! + tff£&- (2D 



where ipj is the j'th row of the matrix 

From (21), (18) and the assumption that the error variances 
a? tends to zero at final iterations, we can find that the 

(0) 

value of varies from a large initial value = a\ + 



T,iLi,i& b ij a tel t0 a sma11 value cr - 
(14), the threshold is changed from an initial large value 

Tr4 0) = Thl (o) to a small final value Th^° o) = Thl w . 

The initial value and the final value (after infinite iterations) 

of the threshold are: 

Thf> =Th| , 0) , (22) 



,(oo) 

7j 



o . So, from 



Th^ = Th(°°) = ThL =<Te « Ka e . 



(23) 



where if = ^2 \a{^^-). In (23), it has been assumed that 
the algorithm converges to the true solution and hence u lj — > 
a e . As we can see from (22), the initial thresholds are different 
for each coefficient. But, all the thresholds are converging to 
the same value (23). 

As we explain in Section V, a common threshold is used for 
all coefficients for simplicity of the algorithm. As the value 
of threshold changes from a large value to a small value, the 
algorithm can detect more and more atoms. During the first 
iterations, the optimal thresholding strategy in (14) changes 
the thresholds very fast and then after a few iterations, the 
thresholds converge to the final small value. 

In the thresholding strategy (14), although there can be 
a simple stopping rule for iterations based on the value of 
thresholds (which will be explained in our experiments), the 
number of required iterations for convergence are not known 
in advance. So, we can use another threshold to predict the 
number of the iterations in advance. The simplest way for up- 
dating the threshold is to decrease the threshold geometrically 
from the initial value Th( 0) in (22) to final value Thj oo) in 
(23): 

(24) 



Th* (n+1) = a Th* (n) 



' 3 3 

where superscript * is for simple thresholding strategy. Since 
the final threshold is the same for all coefficients, we should 
also use the same threshold for initialization in the simple 
thresholding strategy. As we see in Section V, we also use the 
same thresholds for optimal thresholding. So, we can use the 
same initial value for threshold in simple thresholding just like 
in optimal thresholding (i.e., Thj ' = Th^ ^). So, in simple 
thresholding, the required number of iterations is: 



ln(a) ' 



(25) 



where 

Th (oo) 

is as defined in (23). In other words, using 
t iterations of (24), Th* (rl) changes from Th (0) to Th (oc) 



which were defined in (22) and (23). So, with this strategy 
of selecting the thresholds, we can predict the number of 
iterations in advance. The practical choice will be explained in 
the experimental results section. We will refer to this method 
as simple thresholding, while the straightforward method is 
referred to as optimal thresholding. The simple thresholding 
strategy is similar to IDE with the difference that here we 
know the first and last values of thresholds while IDE has no 
ideas for initial and last values of thresholds. 

After updating the activity vector based on decision rule in 
(13), the estimation of amplitude vector r which was defined 
in Section II, based on this estimated activity vector can be 
done by a Linear Least Square (LLS) estimation [43], [40]: 

f = CT^Q^T^^q^T + a 2j)-l x (26) 

where q is the estimated activity vector and Q = diag(q). 
It is worth mentioning that (26) has the same type of update 
as used in iterative re-weighted least squares algorithms such 
as FOCUSS algorithm [9]. In fact, (26) is nothing but this 
standard approach, but with a novel way for calculating the 
weights. 

B. Soft-BHTA 

In the previous subsection, we presented the Hard-version 
of BHTA. As we saw, determining the threshold is relatively 
complex. Therefore, in this section we suggest a Soft-version 
of BHTA to avoid the threshold computation. The main idea 
of soft version of BHTA is to use soft posterior probabilities as 
the soft hypothesis testing results instead of binary values '0' 
or '1' for activity measure g,. If the value of the posterior 
probability p(qi = l\zi) is high, then it means that it is 
more probable that the i'th coefficients are nonzero or the 
z'th atom is active. So, we simply replace % by p(qi = l\zi) 
as we will see at (28). At the final iteration, we use a hard 
thresholding for providing a binary value for each So, with 
this trick, all atoms are participated in the sparse representation 
in the initial iterations. Then, we replace the activity measures 
by the posteriors, which determine active (or inactive) atoms 
of the sparse representation. In this case, the z/s are the 
correlations which are used in pursuit algorithms and p(qi = 
l\zi) is the posterior of the i'th coefficient conditionally to the 
correlations. 

To compute the posteriors, we use the Bayes rule as: 

p(H 1 )p(z i \H 1 ) + p{H 2 )p{z t \H 2 ) 

(27) 

Using (1 1), in each iteration of soft version of BHTA, we must 
do the following update for the soft-activity measure: 

™ 2(^+^2) ) 

* n _^i , \2 ; ~ , _( z ._r„ .12 ; • (28) 



rrii) 2 \ ' 
2d > 



where parameters p, a r and ct 7 are obtained and updated as 
in the hard version of BHTA. After the convergence, we can 
use a simple hard thresholding as p{qi = l\zi) > 0.5 to obtain 
the active atoms. 

After updating the activity, we can use a formula similar to 
(26) for updating the amplitude vector. 
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Initialization: 



1) Initial parameter estimation: 
p(°) = 0.8, 
~ (o) _ Mb 



,(0) 



.(0) 



4Z P)ll*ill§ + <'2£ fc &- 

2) Let the initial solution from minimum ^ 2 -norm: 

y(0) = # t x 

3) Let the correlations Zj =< x, <pj > 
• Until Convergence do: 

1) cj = YJ?=\,i±j Vibij 

2) <Ji^ y •(- a<Ti, e 



3) ^ = ^ + E£i,^ ^ 

4) Th,- = - 



2 



5) Activity Detection: 

Activity ( % ) 4 q 3 = (\ Zj - Cj | > Th,) 

6) Amplitude Estimation: 

y = CT 2 Q* T ( CT 2 #Q# T + CT |i)-i x . 

7) Parameter Estimation Step: using (15), (16) and (17). 
• Final answer is y. 



Fig. 2. The Hard-BHTA algorithm. 
C. Summary 

Finally, to summarize the presentation of BHTA, we rep- 
resented the detailed Hard-BHTA algorithm in Fig. 2. Initial- 
ization is done by minimum ^ 2 -norm solution (see (34) in 
Appendix I). Updating the activity vector in Hard-BHTA is 
done by the decision rule in (13). Updating the coefficients or 
amplitudes is done by (26). Similarly, updating the parameters 
are done by equations (18), (19), (21) and with parameters 
p, <r e and (7 r computed according to (15), (16) and (17). 
Threshold determination in Hard-BHTA is done by equations 
(21), (19) and then (18) for the variance of coefficient errors. 
Then, as we explained in Section III-A, we suggested two 
different thresholding strategy which are optimal thresholding 
(14) and simple thresholding (24). We investigate these differ- 
ent strategies in the simulation results. 



then (29) is equivalent to have an input SNR greater than a 
minimum input SNR, i.e., SNR^ > SNR mi „ which is: 



SNR mm (y,j) ^ 10 log 



b 2 V I 2 

l J L^dr jr 



( p )2 _ y^ m u2 .1.2 ■ 

V (l-p)e / Z-ti=l,i^j u ij Z^rGsupp(y) Y jr 

(30) 



where this minimum SNR depends on the unknown y. For 
canceling the dependence on y, we replace the denominator 
by its expectation with respect to y (like for deriving (21)) and 
the minimum SNR becomes: 



SNR mm (j) 4 10 log 



i) 



tf+(l-p)||^.||2( 



Nll-i)- 



(31) 



where K — ( (i- p ) e ) 2 ' ^ * s ^ e f^ 1 c °l umn of B and lj is 
the j'th row of L. Although the formula for minimum input 
SNR in (31) seems complicated, it is not very restrictive, i.e., 
this minimum value is not very high. To evaluate the values for 
the minimum input SNR, we compute them for the practical 
case of CS where the real signals are sparse in DCT domain. 
In this case, the matrix $ = TD where T is the random CS 
measurement matrix and D is the DCT matrix. The random 
measurement matrix elements are drawn from a zero mean 
normal random distribution with unit variance. The columns 
of the dictionary matrix $ are normalized to have unit norms. 
A typical simulation in this case shows that the minimum and 
maximum values of SNR m j n (j) over different j's are -11.4127 
dB and 0.3560 dB. So, the practical values of SNR mi „ are not 
very high, and the sufficient condition for stability (29) is a 
weak condition and easily satisfied. 

For the stability analysis of soft-BHTA, we define the 

eX Pl 2^+^-1 ) aIld 



two terms in (28) as l\ = 

h - ^ exp( n _. 



With these definitions, l\ > l 2 is 



equivalent to % > 0.5 and li < l 2 is equivalent to < 0.5. 
It is simple to show that the condition li > l 2 is equivalent to 
\zi — mi\ > Thj which is similar to the decision rule of Hard- 
BHTA in (13). Therefore, the stability conditions for hard- 
BHTA and soft-BHTA are the same. 



IV. Stability Analysis 

Because of the thresholding strategy, a complete conver- 
gence analysis to the algorithm is very tricky, and is not 
addressed in this paper. Hence, in this section, we only study 
the stability of BHTA, i.e., if the algorithm does not diverge 
and is stable. 

The stability of the BHTA is equivalent to the convergence 
of the sequence of thresholds in (14). As we know, this is 
a positive sequence and it is bounded below by zero. So, if 
this sequence is a decreasing sequence, then it converges of 
course to the value in (23). In Appendix II, we show that with 
the assumption that cr 7i <C a r , the sufficient condition for the 
convergence of the sequence of j'th threshold is: 

2 U2 12 

a r . L^i=\,i=^j ij L^r jr 

2 ( P \2 \^rn >2 /2 ' 

a e \{l-p)e> ~ 2^i=l,i^j °ij 2^resupp(y)Yjr 

where e is the neper number, B = 4> T $, = 1 + <J>^<I> and 
L = [lij] = 2* T - * f . If we define the input SNR as in (33), 



V. Experiments 

The BHTA algorithm is investigated in this section with 
three different categories of simulation. First, in subsection V- 
A, we only consider soft and hard versions of the BHTA 
algorithm. It includes the two different thresholding strategies 
and some detailed implementation issues of the algorithm. 
Secondly, in Section V-C, comparison will be done with the 
other main algorithms for sparse representation both from 
complexity and estimation accuracy viewpoints. The perfor- 
mance of the algorithms is compared using the Signal to 
Noise Ratio between the true coefficients and the recovered 
coefficients, which is defined as: 

SNR P = lOlog d, ). (32) 

My — y II 2 

where the index o denotes output SNR. In fact, this SNR in 
the coefficient domain determines the capability of the sparse 
representation algorithm to recover the true sparse coefficients 
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in average. We define another measure which determines the 
noise level. We refer to it as input SNR: 



SNR^201og(^). 

0>. 



(33) 



This input SNR is varied from 20dB to 50dB in the experi- 
ments. 

We use the CPU time as a measure of complexity. Although, 
the CPU time is not an exact measure, it can give us a rough 
estimation of the complexity for comparing our algorithms. 
Our simulations were performed in MATLAB7.0 environment 
using an AMD Athlon Dual core 4600 with 896 MB of RAM 
and under Windows XP operating system. 



- Hard-BHTA (Optimal Thresholding, alpha.0.95) 

- Hard-BHTA (Simple Thresholding, alpha=0. 95} 

- Soft-BHTA 




40 

Input SNR (dB) 



A. Implementation issues of BHTA 

In this part of our experiments, the implementation aspects 
of the BHTA algorithm is experimentally discussed and evalu- 
ated. We mainly have two hard and soft versions of the BHTA 
algorithm, since we use two distinct methods for updating the 
threshold. 

We used a random dictionary matrix with normalized 
columns whose entries are previously drawn according to a 
uniform distribution in [—1,1]. The number of atoms is set 
to to = 512 and the signal length to n = 256. For the 
sparse coefficients, we used the model (2) with the probability 
p = 0.9 and unit variance for the active coefficients (o> = 1). 
So, on the average, about 51 atoms are active in the sparse 
representation of the signal. The noises or errors are Gaussian 
with zero-mean and different variances. The measure of per- 
formance, the output SNR (32), is averaged over 100 different 
random realizations of the dictionary, sparse coefficients and 
noise vector. 

For simplifying the algorithm, we use the same variance 
and threshold for all coefficients. This simplification reduces 
some of our calculations by a factor of ^. Since the value 
of a\ is not known in advance and the term <?U2itfi * s 



small in comparison to other term, we select a 



^(1 



f^HV^IIi- To remove the dependency on the index j, we 
select erf"' 



Fig. 3. The output SNR averaged on 100 runs versus the input SNR for 
Hard-BHTA with two different thresholding strategies and Soft-BHTA. The 
parameters are m = 512, n = 256, p = 0.9, ay = 1, a = 0.95. 



For the simple thresholding strategy (which is very similar 
to IDE), we start from the initial threshold Tlr ' to the 
final value 

Th (oo) 

by the geometric series (24). To compute 
Th*- 00 * 1 w Ka e , we need the value — . In the simulations of 
this section, we select ^ = 100 for any noise levels and 
the parameter a = 0.95 for both simple thresholding and 
optimal thresholding. Figure 3 shows the results of the two 
versions of Hard-BHTA (with the two thresholding strategies) 
and Soft-BHTA. Clearly, performance of Hard-BHTA is much 
better than Soft-BHTA performance. Of course, the optimal 
thresholding strategy (14) yields better results than the simpler 
strategy (24). 

Finally, to determine the best value of the parameter a in 
(19) and (24), we represent the results of our algorithm with 
respect to the value of a when a e = 0.01 in Fig. 4. As it can 
be seen, better results are obtained when the value is around 
a = 0.95 for optimal thresholding and for simple thresholding. 
However, we use a = 0.95 for the next experiments unless 
we state otherwise. As we can see, the Soft-BHTA and Hard- 
BHTA with simple thresholding are sensitive to the value of 
parameter a, while Hard-BHTA with optimal thresholding is 
less sensitive to this parameter. 



af(l — where the approximation 

HV'jlli ~ ^J! F i s assumed for large random matrix 
With this initialization (which is independent of the coefficient 
index) and (19), all the error variances o~i e are independent 

of the index i and assumed to be a e . So, (18) will reduce to K Investigating the assumptions 



n e 1 
use a 2 = a 2 e - 



1). To omit the dependency on j, we 



-/3(7g where (3 



l|B| 



-1 (since 



Ml 



l|B| 



is a first order approximation of E{ 1 1 b, -\ | \ } = E{ ^ F } which 
holds for random dictionaries. Finally, the values of thresholds 
are the same for all the coefficient indexes. 

The initial values of the unknown statistical parameters (p, 



ay and a e ) are p 



(o) 



0.8, «7< 0) = 



, and of> = *f 



which is similar to the initialization used in [17], [18]. We can 
propose some stopping rules for Hard-BHTA. In Hard-BHTA, 
we used Max(|Th (n+1) - Th (n) |) < as an stopping 

rule. For Soft-BHTA, a similar stopping rule is Max(|q(™ +1) - 



,(«)! 



< 



100' 



In these experiments, the Assumptions 1 to 3 of Section III- 
A are investigated. Since Assumptions 1 and 2 have only 
been used in deriving (18), the influence of these assump- 
tions is experimentally investigated by computing the absolute 
difference between both sides of (18) i.e., the error term 

1°"^ - °l - T,T=1.Mj b ij a i,e y \- Fi 8 Ure 5 sh0WS the err0r term 

over all indices 1 < j < m versus the iteration number. It can 
be seen that the averaged error term is small in comparison to 
cr^. , and vanishes after a few (5 to 6) iterations. In other words, 
as the algorithms converges to the solution, the Assumptions 
1 and 2 become very accurate. 

Assumption 3 (the Gaussianity of jj), is evaluated by 
computing the normalized Kurtosis defined as Kurt (7^) = 
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Fig. 4. The output SNR averaged on 100 runs versus the simulation parameter 
a. Other parameters are m = 512, n = 256, p = 0.9, a r = 1. 



Fig. 6. The normalized kurtosis computed over all indexes 1 < j < m 
and 100 runs of simulations versus the iteration number. The parameters are 
m = 512, n = 256, p = 0.9, oy = 1 and a„ = 0.01. 



- Averaged Error-term 

- Averaged Variance of sigma-gamma 



2 3 4 5 6 7 8 

Iteration-Number 



Fig. 5. The error term \a. 



| and variance 



is computed over 100 runs of simulations. The parameters are m 
n = 256, p = 0.9, oy = 1 and a n = 0.01. 



512, 



jpCTj — 3 [42]. Recall that the kurtosis of a Gaussian random 
variable is zero. We averaged this measure over all coefficients 
indexes j and also over runs of simulation. Averaged kurtosis 
versus iteration number is showed in Fig. 6. It can be seen 
that the value of kurtosis is small after some iterations and 
hence the assumption of Gaussianity of jj would be a good 
approximation after a few iterations. 



C. Comparison with other sparse representation algorithms 

In this experiment, we only compare the optimal thresh- 
olding version of Hard-BHTA and Soft-BHTA with other 
main sparse representation algorithms such as BP, MP, OMP, 
StOMP, SL0, BCS, GP, GPSR and IBA. In this experiment, 
we use another model for generating the sparse coefficients. 
We choose the inactivity probability p = 0.9 and all active 
coefficients are set equal to 1 instead to be distributed as a 
Gaussian random variable with a unit variance. The locations 
of active coefficients are uniformly random. The input SNR 



is defined as SNRj = 201og(^-). The comparisons are done 
in three cases. The first case is the comparison of the average 
estimation accuracy (Output SNR) versus the input noise level 
(Input SNR). The second comparison is the same measure of 
estimation accuracy (Output SNR) versus the sparsity level. 
Finally, we compare complexity of the different algorithms. 
In all experiments, the results are averaged over 100 different 
runs, with random dictionary and random sparse coefficents. 

For BHTA, we use the simulation parameters used in the 
previous experiment. BP algorithm was tested using i? 1 -magic 
package [45]. Since there are 51 active atoms in average, we 
run the MP, OMP and StOMP algorithms (implemented by 
SparseLab 2 ) for twice the number of active atoms which is 
102 (a similar strategy is used in [32] for yielding better 
performances). For StOMP, we used default parameters of 
the SparseLab code with the difference that we used similar 
number of iterations to MP and OMP (instead of 10 which is 
the default value). For SL0 algorithm 3 , we used the minimum 
a equal to a e and the decreasing factor, a parameter which 
determined a tradeoff between accuracy and speed, equal 
to 0.9. For the IBA algorithm, we used 4 iterations for 
both the M-step and the overall algorithm [17]. For GPSR 
algorithm 4 [15], we used r = 0.1||$ T x|| oo as suggested by 
the authors. We also use a debiasing step in GPSR algorithm. 
The algorithm stops if the norm of the difference between two 
consecutive estimates, divided by the norm of one of them falls 
below 10~ 4 . The other parameters of GPSR are the default 
values. We also used the recommended and default parameters 
for BCS 5 [23]. We used Sparsify toolbox for GP algorithm 6 
[32], with default parameters and we stop the algorithm if 



SolveStOMP.m are available at 



2 The codes SolveMP, SolveOMP, 
http://sparselab.stanford.edu 

3 The code slO.m is used which is available at http://ee.sharif.edurSLzero 

4 The code GPSR_fun.m is used which is available at 
http://www.lx.it.pr mtf/GPSR/GPSR6.() 

5 The codes in bcs-vb.zip are used which are available at 
http://people.ee.duke.edun ihan/cs 

6 We used the latest version of code greed_gp.m, available at 
http://www.see.ed.ac.uk/~tblumens/sparsify/sparsify.html 
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Fig. 7. The averaged output SNR on 100 runs of simulations versus input 
SNR for various algorithms. The parameters are m = 512, n = 256, p = 0.9, 
a = 0.95. 



Fig. 8. The averaged Output SNR on 100 runs of simulations versus sparsity 
level. The parameters are m = 512, n = 256, cr e = 0.01, a = 0.95. 



the mean squared error of residual is below a\. Figure 7 
shows the performance of the various algorithms (output SNR 
in coefficient domain) versus the noise level (input SNR). It 
shows that our algorithm is one of the best algorithms in terms 
of estimation accuracy specially for low noises. 

To investigate the performance of the algorithms for various 
sparsity levels, we plot (Fig. 8) the output SNR versus sparsity 
level which is determined in our statistical model (2) by 
probability (1 — p). In this experiment, we used a fixed 
number of nonzero coefficients with amplitudes equal to 1. 
The sparsity ratio is defined as ■ Again, it can be seen 
in Fig. 8 that the Hard-BHTA algorithm is one of the best 
algorithms. 

Finally, we compare the algorithm in terms of speed. Fig- 
ure 9 shows the average simulation time of various algorithms 
with respect to the dimension of our sparse representation 
problem (i.e., signal length). The dimension of our problem 
is determined with the number of atoms and the length of 
the signal. In this experiment, we used m = 2n for different 
signal lengths from 64 to 512. It shows that our algorithm is 
the most complex method. 

D. Comparison of algorithms in real-field decoding applica- 
tion 

In this section, we compare the algorithms in real-field 
coding. In real-field coding, we first encode a block vector of 
real-valued samples by a random generating matrix. Assume 
the input message vector is s = [si , s 2 , s n ] T . The encoded 
message is x = Gs where G is an n x m matrix with n < m 
(adding redundancy to input messages). Then, we assume that 
channel adds both impulse errors and a background noise. So, 
the channel output is equal toy = x + e + v where e is 
channel errors and v is the background noise. We can define 
a parity check matrix H associated to the generating matrix G 
such that HG = [3]. Then, the errors can be reconstructed 
by solving the underdetermined linear system of equations 
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Fig. 9. The averaged simulation time on 100 runs of simulations versus m 
in the case of m = 2n. The parameters are p = 0.9, <r e = 0.01, a = 0.95. 



y = Hy = He + w where w = Hv is the noise term. After 
estimating the error vector e by means of sparse representation 
algorithms, it can be subtracted from the output channel to 
yield the corrected encoded message x. Finally, the original 
messages can be recovered using s = G^x where G^ denotes 
the pseudo-inverse of G. 

The standard Lena image is used as input message. The 
pixels of image are vectorized and then divided in blocks 
of length n = 128. Entries of the generating matrix 5^ are 
also randomly selected from uniform distribution in [—1,1]. 
For channel impulse errors, we used the model (2). The 
background noise v is generated from zero mean Gaussian 
distribution with variance a^. The input SNR is defined as 
SNR, = 201og(f^). The output SNR between the original 
message s and the estimated message s is similarly defined as 
SNR = 101og(,-|M2_). We vary the input SNR from 30dB 
to 60dB. Figure 10 shows the averaged result (over 100 blocks 
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Fig. 10. The averaged output SNR on 100 runs of simulations versus input 
SNR for various algorithms in real-field decoding when impulse noise is BG 
with parameters p = 0.9 and a r = 1. The other parameters are m = 256, 
n = 128, a = 0.95. 



where H = and rj = Then, each element of the 

initial solution can be written as: 



^ (0) =^2 h ijVj 



By definition: 



Now, replacing (35) in (36) results in: 



(35) 



(36) 



7 



(o) _ 



^2 ^h ir y r b i: j+ ^ Vi b ij+ ^2 



(37) 

If we add and subtract the terms with i = j, then after some 
simplifications and calculations, we have: 



7 



(0) 



= v 3 - hi i H h 



j r y r 



hj r y r -\- 



Vj- 



(38) 



of the Lena image) of output SNR versus input SNR for BG 
model for errors. For more clarity, we only compare our BHTA 
algorithm with BR GP, BCS, SL0 and OMR and parameters 
are chosen as in the previous experiments. There are just one 
difference: we used stopTol = 10~ 6 for GP algorithm for 
achieving better results. The results show that again the BHTA 
algorithm is one of the best algorithms for real-field decoding 
application. 



VI. Conclusions 

In this paper, we proposed a Bayesian hypothesis testing 
algorithm for sparse representation problem which can also 
be used in other contexts like CS or SCA. The main idea of 
this algorithm is to use rather simple Bayesian hypothesis test 
to estimate which atoms are active in the sparse expansion 
of the signal. The activities of atoms, which are detected 
through a Bayesian test, are based on a comparison of the 
activity measure with a threshold. The interest of the Hard- 
BHTA algorithm is its ability to determine the thresholds 
mathematically with simple parameter estimation techniques 
rather than heuristically. It can be computed practically with 
simple parameter estimation techniques. The comparison of 
Hard-BHTA algorithm with the state of the art algorithms 
shows that Hard-BHTA algorithm achieves one of the best 
performances, but at the price of the highest complexity. 



Appendix I 

Initial parameter estimation for minimum € 2 -norm 
solution 

If the minimum £ 2 -norm solution is selected as the solution 
for the first iteration, then we have: 



^(°) = $t x = Hy + r 1 . 



(34) 



It leads to the following matrix form: 



7 



(o) 



v + (B - i)n - (i 



B + B T H 



H)y. 



(39) 
(40) 



with bij =< <Pi,<Pj > and Vj =< e,(fij >. Using v 
and 77 = ffr^e, then we have: 

7 (o) = * y + Le. 

where L = * T + (B and * = B T H + B + H - I. 

Using B = * T * and H = we have L = 2$ T - *t 

and * = I + H. Finally, (40) results in (20). 

Appendix II 

Sufficient condition for stability of Hard-BHTA 

For small values of o lj in comparison to a r , it can be seen 
that the threshold is proportional to: 



I'll, cx (T. 



Therefore, if we define x 



ln( 



P Pr 



1 — p a 



■)■ 



(41) 



1i 



and c = -r—G r , then 
we should investigate monotonicity of the function f(x) = 
xy / ln(^). Using the derivative of this function, it can be 
seen that the function is decreasing for x < -. This means 



al. < k 2 al where k 



v 



(l-p)e 



. So, we should have: 



^2 1 \ " 1 2 JW , 7,2 2 
< J e+ 2^ b ij a < k a v 



(42) 



Then, for the next iteration, it is obvious that the above 
condition is satisfied because if a < 1 then: 



"ij" i,e 



(43) 



We use the condition in (42) at initialization as the sufficient 
condition for a decreasing threshold. Replacing the initial 
variance of (20) in (42) for n = 0, then after some simple 
manipulations, leads to the sufficient condition (29). 
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